WebSets : Unsupervised Information Extraction approach to Extract Sets of Entities from the Web

نویسندگان

  • Bhavana Dalvi
  • William Cohen
  • Jamie Callan
چکیده

We propose an unsupervised information extraction system, which exploits the structured information in the form of HTML tables to build meaningful sets of entities belonging to certain categories. Due to redundancy on the Web, we believe that entities belonging to important categories will frequently co-occur in table columns. We present a clustering algorithm to cluster such frequently occurring entities into meaningful sets. Then we find candidate category names for each set, using Hearst patterns like “X such as Y”, “X including Y”. Experimental results on four different datasets show that our method can extract meaningful sets of entities (with avg. cluster precision of 97-99%). It also proposes reasonable category names for them. We present an application of this method to enhance an existing knowledge base. Experiments show that our method improves the coverage of existing categories with 80-90% accuracy. It also suggests new categories that can be added to the knowledge base.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...

متن کامل

Towards Supporting Exploratory Search over the Arabic Web Content: The Case of ArabXplore

Due to the huge amount of data published on the Web, the Web search process has become more difficult, and it is sometimes hard to get the expected results, especially when the users are less certain about their information needs. Several efforts have been proposed to support exploratory search on the web by using query expansion, faceted search, or supplementary information extracted from exte...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Review Paper on Named Entity Recognition and Attribute Extraction using Machine Learning

Named entity recognition (NER) is a subsidiary task under information extraction that aims at locating and classifying named entities in the text provided into pre-defined categories such as the names of people, locations, organizations, etc. In focused NER, once the entities are recognized we further aim at finding the most important named entities among all the others in a document, which we ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011